{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>第十章 多元分析<center/>\n",
    "## <center>第二节主成分分析<center/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**介绍**  \n",
    "这里是司守奎教授的《数学建模算法与应用》全书案例代码python实现，欢迎加入此项目将其案例代码用python实现  \n",
    "GitHub项目地址：[Mathematical-modeling-algorithm-and-Application](https://github.com/STL-CC/Mathematical-modeling-algorithm-and-Application)  \n",
    "CSDN专栏：[数学建模](https://blog.csdn.net/stl_cc/category_10228778.html)  \n",
    "知乎专栏：[数学建模算法与应用](https://zhuanlan.zhihu.com/c_1271013077337964544)  \n",
    "**联系作者**  \n",
    "作者：STL_CC  \n",
    "邮箱：<1459078309@qq.com>  \n",
    "\n",
    "由于作者还是大一学生，才疏学浅，难免会有错误，欢迎指正  \n",
    "同时作者精力有限，希望更多大佬加入此项目，一来可以提高建模水平，二来可以分享建模经验"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **主成分回归分析**\n",
    "**例10.5**  \n",
    "Hald水泥问题，考察含如下四种化学成分  \n",
    "$x_1$=$3CaO.Al_2O_3$的含量（%），$x_2$=$3CaO.SiO_2$的含量（%）  \n",
    "$x_3$=$4CaO.Al_2O_3.Fe_2O_3$的含量（%），$x_4$=$2CaO.SiO_2$的含量（%）  \n",
    "的某种水泥，每一克所释放出的热量（卡） y 与这四种成分含量之间的关系数据共13组，见表7，对数据实施标准化，则 X T X /12就是样本相关系数阵（见表8）。  \n",
    "注：表格见教材"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import statsmodels.api as sm\n",
    "from scipy.stats import zscore\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sn=pd.read_csv('sn.txt',header=None,sep='\t')\n",
    "m,n=sn.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>26</td>\n",
       "      <td>6</td>\n",
       "      <td>60</td>\n",
       "      <td>78.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>29</td>\n",
       "      <td>15</td>\n",
       "      <td>52</td>\n",
       "      <td>74.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>11</td>\n",
       "      <td>56</td>\n",
       "      <td>8</td>\n",
       "      <td>20</td>\n",
       "      <td>104.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>11</td>\n",
       "      <td>31</td>\n",
       "      <td>8</td>\n",
       "      <td>47</td>\n",
       "      <td>87.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>52</td>\n",
       "      <td>6</td>\n",
       "      <td>33</td>\n",
       "      <td>95.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>11</td>\n",
       "      <td>55</td>\n",
       "      <td>9</td>\n",
       "      <td>22</td>\n",
       "      <td>109.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>71</td>\n",
       "      <td>17</td>\n",
       "      <td>6</td>\n",
       "      <td>102.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>31</td>\n",
       "      <td>22</td>\n",
       "      <td>44</td>\n",
       "      <td>72.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>54</td>\n",
       "      <td>18</td>\n",
       "      <td>22</td>\n",
       "      <td>93.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>21</td>\n",
       "      <td>47</td>\n",
       "      <td>4</td>\n",
       "      <td>26</td>\n",
       "      <td>115.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>40</td>\n",
       "      <td>23</td>\n",
       "      <td>34</td>\n",
       "      <td>83.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>66</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "      <td>113.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>12</td>\n",
       "      <td>10</td>\n",
       "      <td>68</td>\n",
       "      <td>8</td>\n",
       "      <td>12</td>\n",
       "      <td>109.4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0   1   2   3      4\n",
       "0    7  26   6  60   78.5\n",
       "1    1  29  15  52   74.3\n",
       "2   11  56   8  20  104.3\n",
       "3   11  31   8  47   87.6\n",
       "4    7  52   6  33   95.9\n",
       "5   11  55   9  22  109.2\n",
       "6    3  71  17   6  102.7\n",
       "7    1  31  22  44   72.5\n",
       "8    2  54  18  22   93.1\n",
       "9   21  47   4  26  115.9\n",
       "10   1  40  23  34   83.8\n",
       "11  11  66   9  12  113.3\n",
       "12  10  68   8  12  109.4"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "x0=np.array(sn.iloc[:,0:n-1])\n",
    "y0=np.array(sn.iloc[:,n-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   R-squared:                       0.982\n",
      "Model:                            OLS   Adj. R-squared:                  0.974\n",
      "Method:                 Least Squares   F-statistic:                     111.5\n",
      "Date:                Fri, 31 Jul 2020   Prob (F-statistic):           4.76e-07\n",
      "Time:                        01:24:42   Log-Likelihood:                -26.918\n",
      "No. Observations:                  13   AIC:                             63.84\n",
      "Df Residuals:                       8   BIC:                             66.66\n",
      "Df Model:                           4                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const         62.4054     70.071      0.891      0.399     -99.179     223.989\n",
      "x1             1.5511      0.745      2.083      0.071      -0.166       3.269\n",
      "x2             0.5102      0.724      0.705      0.501      -1.159       2.179\n",
      "x3             0.1019      0.755      0.135      0.896      -1.638       1.842\n",
      "x4            -0.1441      0.709     -0.203      0.844      -1.779       1.491\n",
      "==============================================================================\n",
      "Omnibus:                        0.165   Durbin-Watson:                   2.053\n",
      "Prob(Omnibus):                  0.921   Jarque-Bera (JB):                0.320\n",
      "Skew:                           0.201   Prob(JB):                        0.852\n",
      "Kurtosis:                       2.345   Cond. No.                     6.06e+03\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
      "[2] The condition number is large, 6.06e+03. This might indicate that there are\n",
      "strong multicollinearity or other numerical problems.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Program Files\\anaconda\\lib\\site-packages\\scipy\\stats\\stats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13\n",
      "  \"anyway, n=%i\" % int(n))\n"
     ]
    }
   ],
   "source": [
    "results = sm.OLS(y0, np.c_[np.ones((m,1)),x0]).fit()\n",
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "r=np.corrcoef(x0.T)\n",
    "xd=zscore(x0)\n",
    "yd=zscore(y0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,\n",
       "    svd_solver='auto', tol=0.0, whiten=False)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca=PCA()\n",
    "pca.fit(xd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([5.58926009e-01, 3.94016518e-01, 4.66515373e-02, 4.05936433e-04])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca.explained_variance_ratio_#计算各个变量的贡献率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "num=3#由于我们使用jupyter演示程序，可能不好交互式\n",
    "pca_=PCA(n_components=num)\n",
    "xd_=pca_.fit_transform(xd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   R-squared:                       0.982\n",
      "Model:                            OLS   Adj. R-squared:                  0.976\n",
      "Method:                 Least Squares   F-statistic:                     164.9\n",
      "Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.50e-08\n",
      "Time:                        01:24:42   Log-Likelihood:                 7.7143\n",
      "No. Observations:                  13   AIC:                            -7.429\n",
      "Df Residuals:                       9   BIC:                            -5.169\n",
      "Df Model:                           3                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const       2.012e-16      0.045   4.52e-15      1.000      -0.101       0.101\n",
      "x1            -0.6570      0.030    -22.045      0.000      -0.724      -0.590\n",
      "x2             0.0083      0.035      0.234      0.820      -0.072       0.089\n",
      "x3             0.3028      0.103      2.935      0.017       0.069       0.536\n",
      "==============================================================================\n",
      "Omnibus:                        0.246   Durbin-Watson:                   1.943\n",
      "Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.416\n",
      "Skew:                           0.162   Prob(JB):                        0.812\n",
      "Kurtosis:                       2.186   Cond. No.                         3.46\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Program Files\\anaconda\\lib\\site-packages\\scipy\\stats\\stats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13\n",
      "  \"anyway, n=%i\" % int(n))\n"
     ]
    }
   ],
   "source": [
    "results = sm.OLS(yd, np.c_[np.ones((m,1)),xd_]).fit()\n",
    "print(results.summary())\n",
    "#原MATLAB程序中还做了使系数全为正的处理，在此报告中，x1为负，但是其实只要转成正的就行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "xback=np.dot(xd_,pca.components_[:3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   R-squared:                       0.982\n",
      "Model:                            OLS   Adj. R-squared:                  0.976\n",
      "Method:                 Least Squares   F-statistic:                     164.9\n",
      "Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.50e-08\n",
      "Time:                        01:24:43   Log-Likelihood:                 7.7143\n",
      "No. Observations:                  13   AIC:                            -7.429\n",
      "Df Residuals:                       9   BIC:                            -5.169\n",
      "Df Model:                           3                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const       2.012e-16      0.045   4.52e-15      1.000      -0.101       0.101\n",
      "x1             0.5130      0.073      6.992      0.000       0.347       0.679\n",
      "x2             0.2787      0.039      7.078      0.000       0.190       0.368\n",
      "x3            -0.0608      0.070     -0.866      0.409      -0.220       0.098\n",
      "x4            -0.4229      0.030    -13.871      0.000      -0.492      -0.354\n",
      "==============================================================================\n",
      "Omnibus:                        0.246   Durbin-Watson:                   1.943\n",
      "Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.416\n",
      "Skew:                           0.162   Prob(JB):                        0.812\n",
      "Kurtosis:                       2.186   Cond. No.                     8.35e+15\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
      "[2] The smallest eigenvalue is 4.17e-31. This might indicate that there are\n",
      "strong multicollinearity problems or that the design matrix is singular.\n"
     ]
    }
   ],
   "source": [
    "results = sm.OLS(yd, np.c_[np.ones((m,1)),xback]).fit()\n",
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**说明**  \n",
    "1.在教材原来的matlab程序中，OLS普通最小二乘法系数他是用朴素的方法计算出来的，在本文中，借助python强大的拓展能力，笔者使用了statsmodel统计模型库中的OLS模型进行建模。它不仅能够返回相关系数，更重要的是他可以返回与之相关的各项参数并生成报表。   \n",
    "2.在matlab中，相关系数矩阵计算使用的corrcoef列指标是变量，然而np.corrcoef行指标是向量，所以要先转置。  \n",
    "3.PCA模型在python的sklearn机器学习库中，相比比较裸的MATLAB工具，它提供了更多强大的功能，可以去sklearn文档探索更多，这里写法会和MATLAB写法有所不同。\n",
    "4.协方差矩阵和相关系数矩阵易混：\n",
    "* 相关系数矩阵:相当于消除量纲的表示变量间相关性的一个矩阵\n",
    "* 协方差矩阵:它是没有消除量纲的表示变量间相关性的矩阵  \n",
    "* r=COV(x,y)/D(x)D(y)\n",
    "\n",
    "**参考文献**  \n",
    "1.[python中使用多个模块使用普通最小二乘法回归](https://www.jianshu.com/p/ded62d82a787)   \n",
    "2.[最小二乘法原理](https://blog.csdn.net/m0_38075425/article/details/90738415)  \n",
    "3.[机器学习中的降维算法](https://blog.csdn.net/github_38486975/article/details/88384884)  \n",
    "4.[浅谈方差、协方差矩阵、相关系数矩阵](https://blog.csdn.net/scpcmoon/article/details/80549059)  \n",
    "5.[用scikit-learn学习主成分分析(PCA)](https://www.cnblogs.com/pinard/p/6243025.html)  \n",
    "6.[sklearn中PCA官方文档](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)  \n",
    "7.[使用sklearn进行对数据标准化、归一化以及将数据还原的方法](https://www.jb51.net/article/143571.htm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **案例研究**  \n",
    "**主成分分析案例－我国各地区普通高等教育发展水平综合评价**  \n",
    "主成分分析试图在力保数据信息丢失最少的原则下，对多变量的截面数据表进行最佳综合简化，也就是说，对高维变量空间进行降维处理。本案例运用主成分分析方法综合评价我国各地区普通高等教育的发展水平。  \n",
    "具体资料见教材，部分数据将在程序中给出"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import statsmodels.api as sm\n",
    "from scipy.stats import zscore\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "gj=pd.read_csv('gj.txt',header=None,sep='\t')\n",
    "gj=zscore(gj)\n",
    "r=np.corrcoef(gj.T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([7.50215857e-01, 1.57698725e-01, 5.36213464e-02, 2.06379017e-02,\n",
       "       1.45001273e-02, 2.21867124e-03, 7.12026644e-04, 2.65835385e-04,\n",
       "       7.25914518e-05, 5.69183083e-05])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca=PCA()\n",
    "pca.fit(gj)\n",
    "pca.explained_variance_ratio_#计算各个变量的得分"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
